Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
This data is part of the openintro textbook and we can load and inspect it. There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
?yrbss
data(yrbss)
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
yrbss %>% mutate(race = factor(race))## # A tibble: 13,583 × 13
## age gender grade hispanic race height weight helmet_12m text_while_driv…
## <int> <chr> <chr> <chr> <fct> <dbl> <dbl> <chr> <chr>
## 1 14 female 9 not Black … NA NA never 0
## 2 14 female 9 not Black … NA NA never <NA>
## 3 15 female 9 hispanic Native… 1.73 84.4 never 30
## 4 15 female 9 not Black … 1.6 55.8 never 0
## 5 15 female 9 not Black … 1.5 46.7 did not r… did not drive
## 6 15 female 9 not Black … 1.57 67.1 did not r… did not drive
## 7 15 female 9 not Black … 1.65 132. did not r… <NA>
## 8 14 male 9 not Black … 1.88 71.2 never <NA>
## 9 15 male 9 not Black … 1.75 63.5 never <NA>
## 10 15 male 10 not Black … 1.37 97.1 did not r… <NA>
## # … with 13,573 more rows, and 4 more variables: physically_active_7d <int>,
## # hours_tv_per_school_day <chr>, strength_training_7d <int>,
## # school_night_hours_sleep <chr>
skimr::skim(yrbss)| Name | yrbss |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.0 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.6 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.2 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.0 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.0 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
Before you carry on with your analysis, it’s is always a good idea to check with skimr::skim() to get a feel for missing values, summary statistics of numerical variables, and a very rough histogram.
You will first start with analyzing the weight of participants in kilograms. Using visualization and summary statistics, describe the distribution of weights. How many observations are we missing weights from?
#number of missing values in the 'weight' column:
sumna <- sum(is.na(yrbss['weight']))
print(paste0("there are " , sumna ," missing values." ))## [1] "there are 1004 missing values."
#using mosaic's favstats
favstats(~weight, data=yrbss)## min Q1 median Q3 max mean sd n missing
## 29.9 56.2 64.4 76.2 181 67.9 16.9 12579 1004
#as it is also apparent in the favstat's function output, there are 1004 missing values in the weight column.
#histogram of the weight distribution
ggplot(yrbss, aes(x=weight))+
geom_histogram(bindwidth=5, color ="black", fill = "white")+
labs (title = "Histogram of Weigth Distribution",
y = "Count",
x = "Weight (in kilograms)")+
theme_bw()+
geom_vline(aes(xintercept = mean(weight, na.rm = TRUE)), color="blue", linetype="dashed", size=1)+
NULL#density plot of the weight distribution
ggplot(yrbss, aes(x=weight)) +
geom_density() +
labs (title = "Density Plot of Weight Distribution",
x = "Weight (in kilograms)")+
theme_bw()+
NULLNext, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
Let’s create a new variable in the dataframe yrbss, called physical_3plus , which will be yes if they are physically active for at least 3 days a week, and no otherwise. You may also want to calculate the number and % of those who are and are not active for more than 3 days. Use the count() function and see if you get the same results as group_by()... summarise()
yrbss %>%
filter(!is.na(physically_active_7d))%>%
ggplot(aes(x = factor(physically_active_7d))) +
geom_boxplot(aes(y=weight))+
labs (title = "Distribution of Weight by Number of Active days in a week",
x = "Number of Active Days in a Week",
y = "weight (kg)") +
theme_bw() #adding the new column that shows if a student is physically active 3 or more days a week.
physical <- yrbss %>%
mutate(physical_3plus = case_when(physically_active_7d >= 3 ~ "yes",
physically_active_7d < 3 ~ "no"),
physical_3plus = factor(physical_3plus, levels = c("yes", "no")))
count_physical <- physical %>%
filter(!is.na(physical_3plus)) %>%
count(physical_3plus) %>%
mutate(prop = n/sum(n))
#5685, 2656
prop.test(count_physical$n[2] , count_physical$n[2] + count_physical$n[1])##
## 1-sample proportions test with continuity correction
##
## data: [ out of +count_physical$n out of count_physical$n[2]2 out of count_physical$n[1]
## X-squared = 1522, df = 1, p-value <2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.323 0.339
## sample estimates:
## p
## 0.331
Can you provide a 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week?
TEAM ANSWER:
The 95 percent confidence interval according to the prop.test function is from 0.3228978 to 0.3389583.
data: [ out of +count_physical\(n out of count_physical\)n[2]2 out of count_physical$n[1] X-squared = 1522, df = 1, p-value <2e-16 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.323 0.339 sample estimates: p 0.331
Initial plotting of the data doesn’t suggest a strong correlation between weekly physical activity levels and weight.
Make a boxplot of physical_3plus vs. weight. Is there a relationship between these two variables? What did you expect and why?
#removing rows that contain NAs from the physical table...
physical <- physical %>%
filter(!is.na(physical_3plus) & !is.na(weight))
#box plot of physical_3plus vs weight:
ggplot(physical, aes(x=physical_3plus, y = weight)) +
geom_boxplot()+
labs (title = "Distribution of Weight, Based on weekly activity level",
x = "Working out 3 or more days: Yes or No",
y = "weight (kg)") +
theme_bw()TEAM ANSWER:
People who work out 3 days a week or more are slightly heavier on average. One might expect the opposite, but weight is not a great indicator of fitness levels as it varries widely with height, and the number of workout days might not indicate the intensity of the workouts either. We would expect a stronger negative correlation between bodyfat percentage and number of active days a week.
Boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using either a confidence interval or a hypothesis test. Note that when we calculate the mean, SD, etc. weight in these groups using the mean function, we must ignore any missing values by setting the na.rm = TRUE.
set.seed(1234)
#calculating CI using formulas:
#physical_3plus == "yes
yes_result<- physical %>%
filter(physical_3plus == "yes") %>%
summarise( yes_mean = mean(weight, na.rm = TRUE),
yes_n = count(physical_3plus == "yes"),
yes_sd = sd(weight, na.rm = TRUE),
t_critical = qt(0.975, yes_n -1),
se_yes = yes_sd / sqrt(yes_n),
margin_of_error = t_critical * se_yes,
yes_low_CI= yes_mean - margin_of_error,
yes_high_CI= yes_mean + margin_of_error
)
print(paste0("the confidence interval is ", as.numeric(yes_result['yes_low_CI']), " - ", as.numeric(yes_result['yes_high_CI']) ))## [1] "the confidence interval is 68.094807893518 - 68.8021328880692"
#we can get a very similar result using the bootstrap approach for confidence intervals by infer package:
boot_dist1 <- physical %>%
filter(physical_3plus == "yes") %>%
specify(response = weight) %>%
# Generate bootstrap samples
generate(reps = 1000, type = "bootstrap") %>%
# Calculate mean of each bootstrap sample
calculate(stat = "mean")
boot_dist1## Response: weight (numeric)
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 68.2
## 2 2 68.3
## 3 3 68.4
## 4 4 68.4
## 5 5 68.8
## 6 6 68.0
## 7 7 68.5
## 8 8 68.4
## 9 9 68.6
## 10 10 68.6
## # … with 990 more rows
percentile_ci1 <- get_ci(boot_dist1, level = 0.95, type ="percentile")
#visualisation of the resulting bootstrap distribution and the CIs
visualize(boot_dist1) +
shade_ci(endpoints=percentile_ci1, fill="khaki") +
geom_vline(xintercept = yes_result$yes_low_CI, colour ="red") +
geom_vline(xintercept = yes_result$yes_high_CI, colour = "red") +
NULL#physical_3plus == "no"
no_result <-physical %>%
filter(!is.na(physical_3plus)) %>%
filter(physical_3plus == "no") %>%
summarise( no_mean = mean(weight, na.rm = TRUE),
no_n = count(physical_3plus == "no"),
no_sd = sd(weight, na.rm = TRUE),
t_critical = qt(0.975, no_n -1),
se_no= no_sd / sqrt(no_n),
margin_of_error = t_critical * se_no,
no_low_CI= no_mean - margin_of_error,
no_high_CI= no_mean + margin_of_error)
print(paste0("the confidence interval is ", as.numeric(no_result['no_low_CI']), " - ", as.numeric(no_result['no_high_CI']) ))## [1] "the confidence interval is 66.1286201312585 - 67.2191521213521"
#we can also get a very similar result using the bootstrap approach for confidence intervals by infer package:
boot_dist2 <- physical %>%
filter(!is.na(physical_3plus)) %>%
filter(physical_3plus == "no") %>%
specify(response = weight) %>%
# Generate bootstrap samples
generate(reps = 1000, type = "bootstrap") %>%
# Calculate mean of each bootstrap sample
calculate(stat = "mean")
percentile_ci <- get_ci(boot_dist2, level = 0.95, type ="percentile")
#visualisation of the resulting bootstrap distribution and the CIs
visualize(boot_dist2) +
shade_ci(endpoints=percentile_ci, fill="khaki") +
geom_vline(xintercept = no_result$no_low_CI, colour ="red") +
geom_vline(xintercept = no_result$no_high_CI, colour = "red") +
labs(title="Simulation-Based Bootsrap Distribution for No Answers") +
NULLThere is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.
Write the null and alternative hypotheses for testing whether mean weights are different for those who exercise at least 3 times a week and those who don’t.
H0 = there is no difference in the weight of those who exercise 3 times a week and more compaed to those who do not. h1 = there is a difference in the weight of those who exercise 3 times a week and more and those who do not.
t.test(weight ~ physical_3plus, data = physical)##
## Welch Two Sample t-test
##
## data: weight by physical_3plus
## t = 5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means between group yes and group no is not equal to 0
## 95 percent confidence interval:
## 1.12 2.42
## sample estimates:
## mean in group yes mean in group no
## 68.4 66.7
Since the T value is 5 we can reject the NULL hypothesis, which means that we statistically sufficient evidence to say that there is a difference in the mean weight of those who work out more than 3 days a week and those who don’t.
inferNext, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.
But first, we need to initialize the test, which we will save as obs_diff.
obs_diff <- physical %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 1.77
Notice how you can use the functions specify and calculate again like you did for calculating confidence intervals. Here, though, the statistic you are searching for is the difference in means, with the order being yes - no != 0.
After you have initialized the test, you need to simulate the test on the null distribution, which we will save as null.
null_dist <- physical %>%
# specify variables
specify(weight ~ physical_3plus) %>%
# assume independence, i.e, there is no difference
hypothesize(null = "independence") %>%
# generate 1000 reps, of type "permute"
generate(reps = 1000, type = "permute") %>%
# calculate statistic of difference, namely "diff in means"
calculate(stat = "diff in means", order = c("yes", "no"))Here, hypothesize is used to set the null hypothesis as a test for independence, i.e., that there is no difference between the two population means. In one sample cases, the null argument can be set to point to test a hypothesis relative to a point estimate.
Also, note that the type argument within generate is set to permute, which is the argument when generating a null distribution for a hypothesis test.
We can visualize this null distribution with the following code:
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()Now that the test is initialized and the null distribution formed, we can visualise to see how many of these null permutations have a difference of at least obs_stat of 1.77.
We can also calculate the p-value for your hypothesis test using the function infer::get_p_value().
null_dist %>% visualize() +
shade_p_value(obs_stat = obs_diff, direction = "two-sided")null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
This the standard workflow for performing hypothesis tests.
Recall the IMBD ratings data. I would like you to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not. I have already calculated the confidence intervals for the mean ratings of these two directors and as you can see they overlap.
First, I would like you to reproduce this graph. You may find geom_errorbar() and geom_rect() useful.
In addition, you will run a hypothesis test. You should use both the t.test command and the infer package to simulate from a null distribution, where you assume zero difference between the two.
Before anything, write down the null and alternative hypotheses, as well as the resulting test statistic and the associated t-stat or p-value. At the end of the day, what do you conclude?
H0 = there is no difference between the mean IMBD ratings of Steven Spielberg and Tim Burton H1 = there is a difference between the mean IMBD ratings of Steven Spielberg and Tim Burton
You can load the data and examine its structure
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies) ## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
Your R code and analysis should go here. If you want to insert a blank chunk of R code you can just hit Ctrl/Cmd+Alt+I
compare2 <- movies %>%
filter(director %in% c("Tim Burton", "Steven Spielberg")) %>%
group_by(director) %>%
summarise(mean = round(mean(rating), 2),
n = count(director),
sd = sd(rating),
t_critical = qt(0.975, n - 1),
se = sd / sqrt(n),
margin_of_error = t_critical * se,
low_CI= round(mean - margin_of_error, 2) ,
high_CI= round(mean + margin_of_error, 2)
)%>%
mutate(director = factor(director, levels = c("Tim Burton","Steven Spielberg")))
graph <- ggplot(compare2, aes(colour=director)) +
geom_errorbar(aes(xmin = low_CI, xmax = high_CI, y= director), width = 0.1, size = 1.5) +
scale_color_manual(values = c("skyblue","tomato"))+
geom_point(aes(x=mean, y=director), size = 3 ) +
labs(title="Do Spielberg and Burton have the same IMDB ratings?",
subtitle="95% confidence intervals overlap",
x="Mean IMDB Rating",
y =" ") +
geom_text(aes(label = low_CI, x=low_CI, y=director), size = 4, color="black", hjust = 1, vjust = 0, nudge_x = 0.05, nudge_y = 0.08) +
geom_text(aes(label = high_CI, x=high_CI, y=director),size = 4, color="black", hjust = 1, vjust = 0, nudge_x = 0.05, nudge_y = 0.08) +
geom_text(aes(label = mean, mean, y=director), size = 6, color="black", hjust = 1, vjust = 0, nudge_x = 0.05, nudge_y = 0.08)+
geom_rect( mapping=aes(xmin= 7.27, xmax= 7.33, ymin=0, ymax=3), color="lightgrey", alpha=0.2) +
theme_bw()
graph +theme(legend.position = "none") set.seed(1234)
#initial observation table:
movies %>%
filter(director %in% c("Steven Spielberg", "Tim Burton")) %>%
group_by(director) %>%
summarise(n = n(),
mean = mean(rating),
sd = sd(rating)) ## # A tibble: 2 × 4
## director n mean sd
## <chr> <int> <dbl> <dbl>
## 1 Steven Spielberg 23 7.57 0.695
## 2 Tim Burton 16 6.93 0.749
compare <- movies %>%
filter(director %in% c("Steven Spielberg", "Tim Burton"))
#CI for both
compare2 <- movies %>%
filter(director %in% c("Tim Burton", "Steven Spielberg")) %>%
group_by(director) %>%
summarise(mean = mean(rating),
n = count(director),
sd = sd(rating),
t_critical = qt(0.975, n - 1),
se = sd / sqrt(n),
margin_of_error = t_critical * se,
low_CI= mean - margin_of_error,
high_CI= mean + margin_of_error,
)
#t-test
t.test(rating ~ director, data = compare)##
## Welch Two Sample t-test
##
## data: rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
## 0.16 1.13
## sample estimates:
## mean in group Steven Spielberg mean in group Tim Burton
## 7.57 6.93
#CI: 0.1596624 1.1256637, t=2.714
obs_diff2 <- compare %>%
specify(rating ~ director) %>%
calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
obs_diff2## Response: rating (numeric)
## Explanatory: director (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.643
#hypothesis testing with infer:
null_dist_movies <- compare %>%
# specify variables
specify(rating ~ director) %>%
# assume independence, i.e, there is no difference
hypothesize(null = "independence") %>%
# generate 1000 reps, of type "permute"
generate(reps = 1000, type = "permute") %>%
# calculate statistic of difference, namely "diff in means"
calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
#visualise the null distribution
null_dist_movies %>% visualize() +
shade_p_value(obs_stat = obs_diff2, direction = "two-sided")null_dist_movies %>%
get_p_value(obs_stat = obs_diff2, direction = "two_sided")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.008
At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.
You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.
omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame## Rows: 50
## Columns: 3
## $ salary <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…
skim(omega)| Name | omega |
| Number of rows | 50 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| salary | 0 | 1 | 68717 | 8638.2 | 47033 | 63303.16 | 68847 | 74777.7 | 84576 | ▁▃▇▆▅ |
| experience | 0 | 1 | 14 | 11.9 | 0 | 2.25 | 15 | 20.8 | 44 | ▇▃▅▂▁ |
The data frame omega contains the salaries for the sample of 50 executives in the company. Can you conclude that there is a significant difference between the salaries of the male and female executives?
Note that you can perform different types of analyses, and check whether they all lead to the same conclusion
. Confidence intervals . Hypothesis testing . Correlation analysis . Regression
Calculate summary statistics on salary by gender. Also, create and print a dataframe where, for each gender, you show the mean, SD, sample size, the t-critical, the SE, the margin of error, and the low/high endpoints of a 95% condifence interval
# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)## gender min Q1 median Q3 max mean sd n missing
## 1 female 47033 60338 64618 70033 78800 64543 7567 26 0
## 2 male 54768 68331 74675 78568 84576 73239 7463 24 0
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size, the t-critical value, the standard error, the margin of error, and the low/high endpoints of a 95% condifence interval
sumstat <- omega %>%
group_by(gender) %>%
summarise(mean = mean(salary),
n = count(gender),
sd = sd(salary),
t_critical = qt(0.975, n - 1),
se = sd / sqrt(n),
margin_of_error = t_critical * se,
low_CI= mean - margin_of_error,
high_CI= mean + margin_of_error,
)What can you conclude from your analysis? A couple of sentences would be enough
The confidence intervals for the mean salaries by gender do not overlap meaning we have statistically sufficient evidence to conclude that there is a difference between in mean salary by gender. We would assume that in running a hypothesis test, our p-value will be less than 0.05 but we will run the hypothesis test to double-check.
You can also run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money. You should tun your hypothesis testing using t.test() and with the simulation method from the infer package.
# hypothesis testing using t.test()
t.test(salary ~ gender, data = omega)##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -12973 -4420
## sample estimates:
## mean in group female mean in group male
## 64543 73239
# hypothesis testing using infer package
set.seed(1234)
null_dist_salary <- omega %>%
# specify variables
specify(salary ~ gender) %>%
# assume independence, i.e, there is no difference
hypothesize(null = "independence") %>%
# generate 1000 reps, of type "permute"
generate(reps = 1000, type = "permute") %>%
# calculate statistic of difference, namely "diff in means"
calculate(stat = "diff in means", order = c("female", "male"))
obs_diff3 <- omega %>%
specify(salary ~ gender) %>%
calculate(stat = "diff in means", order = c("female", "male"))
obs_diff3## Response: salary (numeric)
## Explanatory: gender (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -8696.
#visualise the null distribution
null_dist_salary %>% visualize() +
shade_p_value(obs_stat = obs_diff3, direction = "two-sided")null_dist_salary %>%
get_p_value(obs_stat = obs_diff3, direction = "two_sided")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
What can you conclude from your analysis? A couple of sentences would be enough
Though the infer package returns a 0 p_value, it is impossible to have a p value of 0. Infer package sometimes returns p_values of 0 when the p-value is very small. When running the t-test, we got a p-value of 2e-04 which is very low and it confirms our hypothesis that inferpackage rounded down the p-value to 0. In conclusion the null hypothesis is rejected and that there is a difference between the salaries of men and women.
At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).
# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)## gender min Q1 median Q3 max mean sd n missing
## 1 female 0 0.25 3.0 14.0 29 7.38 8.51 26 0
## 2 male 1 15.75 19.5 31.2 44 21.12 10.92 24 0
Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Perform similar analyses as in the previous section. Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?
H0 = There is no difference between the mean experience of males and females. mean(f)-mean(m) = 0 HA = There is a difference between the mean experience of males and females. mean(f)-mean(m)!= 0
#manual CI calculation:
sumstatExp <- omega %>%
group_by(gender) %>%
summarise(mean = mean(experience),
n = count(gender),
sd = sd(experience),
t_critical = qt(0.975, n - 1),
se = sd / sqrt(n),
margin_of_error = t_critical * se,
low_CI= mean - margin_of_error,
high_CI= mean + margin_of_error,
)
#t-test:
t.test(experience ~ gender, data = omega)##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
Confidence intervals do not overlap and t-test gives a p value of 1e-05, so that we can reject the null hypothesis and there is a statistically sufficient evidence to conclude that there is a difference between the mean experience of men and women at work.
Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.
Analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data
ggplot(omega, aes(x=experience, y=salary )) +
geom_point()+
geom_smooth()+
labs(title="Scatterplot of Experience vs Salary")+
theme_bw() There’s a positive relationship between experience and salary up until 30 years, and then the trend line seems to flatten.
You can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).
omega %>%
select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
ggpairs(aes(colour=gender, alpha = 0.3))+
theme_bw()Look at the salary vs experience scatterplot. What can you infer from this plot? Explain in a couple of sentences
Women have a lot less experience and less salary and females have a stronger correlation between their experience and salary than males. Whereas males have more experience but there is less correlation between salary and experience. It is interesting that women do not work in the company as long. We could look into why women have less experience at the company. Is it because they are having kids? or is it because they are not being promoted to management roles because they are women?
Using your data manipulation and visualisation skills, please use the Brexit results dataframe (the same dataset you used in the pre-programme assignement) and produce the following plot. Use the correct colour for each party; google “UK Political Party Web Colours” and find the appropriate hex code for colours, not the default colours that R gives you.
brexit <- read_csv(here::here("data", "brexit_results.csv"))glimpse(brexit)## Rows: 632
## Columns: 11
## $ Seat <chr> "Aldershot", "Aldridge-Brownhills", "Altrincham and Sale W…
## $ con_2015 <dbl> 50.6, 52.0, 53.0, 44.0, 60.8, 22.4, 52.5, 22.1, 50.7, 53.0…
## $ lab_2015 <dbl> 18.3, 22.4, 26.7, 34.8, 11.2, 41.0, 18.4, 49.8, 15.1, 21.3…
## $ ld_2015 <dbl> 8.82, 3.37, 8.38, 2.98, 7.19, 14.83, 5.98, 2.42, 10.62, 5.…
## $ ukip_2015 <dbl> 17.87, 19.62, 8.01, 15.89, 14.44, 21.41, 18.82, 21.76, 19.…
## $ leave_share <dbl> 57.9, 67.8, 38.6, 65.3, 49.7, 70.5, 59.9, 61.8, 51.8, 50.3…
## $ born_in_uk <dbl> 83.1, 96.1, 90.5, 97.3, 93.3, 97.0, 90.5, 90.7, 87.0, 88.8…
## $ male <dbl> 49.9, 48.9, 48.9, 49.2, 48.0, 49.2, 48.5, 49.2, 49.5, 49.5…
## $ unemployed <dbl> 3.64, 4.55, 3.04, 4.26, 2.47, 4.74, 3.69, 5.11, 3.39, 2.93…
## $ degree <dbl> 13.87, 9.97, 28.60, 9.34, 18.78, 6.09, 13.12, 7.90, 17.80,…
## $ age_18to24 <dbl> 9.41, 7.33, 6.44, 7.75, 5.73, 8.21, 7.82, 8.94, 7.56, 7.61…
head(brexit)## # A tibble: 6 × 11
## Seat con_2015 lab_2015 ld_2015 ukip_2015 leave_share born_in_uk male
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aldershot 50.6 18.3 8.82 17.9 57.9 83.1 49.9
## 2 Aldridge-Bro… 52.0 22.4 3.37 19.6 67.8 96.1 48.9
## 3 Altrincham a… 53.0 26.7 8.38 8.01 38.6 90.5 48.9
## 4 Amber Valley 44.0 34.8 2.98 15.9 65.3 97.3 49.2
## 5 Arundel and … 60.8 11.2 7.19 14.4 49.7 93.3 48.0
## 6 Ashfield 22.4 41.0 14.8 21.4 70.5 97.0 49.2
## # … with 3 more variables: unemployed <dbl>, degree <dbl>, age_18to24 <dbl>
brexit <- brexit %>%
select(c("con_2015", "lab_2015", "ld_2015", "ukip_2015", "leave_share"))
names(brexit) <- c("Conservative" ,"Labour", "Lib_Dems", "UKIP", "Leave_Share")
long_brexit <- pivot_longer(brexit, c("Conservative" ,"Labour", "Lib_Dems", "UKIP"), names_to="party")
ggplot(long_brexit, aes(x = value, y = Leave_Share, color=party ) ) +
geom_smooth(size=0.7, method = "lm" ) +
geom_point(size=0.7, alpha = 0.3) +
labs( title = "How political affiliation translated to Brexit voting",
x = "Part % in the UK 2015 general election",
y = "Leave share in the UK 2016 general election") +
theme_light() +
theme(legend.position = "bottom") +
scale_colour_manual(labels = c("Conservative" ,"Labour", "Lib_Dems", "UKIP", "Leave_Share") ,
values = c("#0087DC","#E4003B", "#FAA61A", "#EFE600")) #Challenge 2: CDC COVID-19 Public Use Data
The CDC Covid-19 Case Surveillance Data is a case surveillance public use dataset with 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors. You can see the variables from
There are well over 28 million entries of individual, and we will work with SQLlite database, rather than a CSV file. I would like you to produce two graphs that show death % rate:
To do this, you will have to think what dplyr verbs to use to select, filter, group_by, etc. You will then use the example shown in https://mam2022.netlify.app/reference/reference_sql/#establish-a-connection-with-the-sqlite-database-1 to use dplyr, dbplyr, and ggplot to produce these graphs.
At the risk of oversimplifying things, the main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports). You can read more about GDP and the different approaches in calculating at the Wikipedia GDP page.
The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries from 1970 to today. We will look at how GDP and its components have changed over time, and compare different countries and how much each component contributes to that country’s GDP. The file we will work with is GDP and its breakdown at constant 2010 prices in US Dollars and it has already been saved in the Data directory. Have a look at the Excel file to see how it is structured and organised
UN_GDP_data <- read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
sheet="Download-GDPconstant-USD-countr", # Sheet name
skip=2) # Number of rows to skipThe first thing you need to do is to tidy the data, as it is in wide format and you must make it into long, tidy format. Please express all figures in billions (divide values by 1e9, or \(10^9\)), and you want to rename the indicators into something shorter.
make sure you remove
eval=FALSEfrom the next chunk of R code– I have it there so I could knit the document
tidy_GDP_data <- UN_GDP_data %>%
pivot_longer(cols = 4:51,
names_to = "year") %>%
mutate(value=value/1e9)
glimpse(tidy_GDP_data)## Rows: 176,880
## Columns: 5
## $ CountryID <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
## $ IndicatorName <chr> "Final consumption expenditure", "Final consumption expe…
## $ year <chr> "1970", "1971", "1972", "1973", "1974", "1975", "1976", …
## $ value <dbl> 5.56, 5.33, 5.20, 5.75, 6.15, 6.32, 6.37, 6.90, 7.09, 6.…
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")First, can you produce this plot?
skimr::skim(tidy_GDP_data)| Name | tidy_GDP_data |
| Number of rows | 176880 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Country | 0 | 1 | 4 | 34 | 0 | 220 | 0 |
| IndicatorName | 0 | 1 | 17 | 88 | 0 | 17 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 48 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CountryID | 0 | 1.00 | 439.2 | 254 | 4 | 214.00 | 440.0 | 660.0 | 894 | ▇▇▇▇▆ |
| value | 15421 | 0.91 | 72.2 | 447 | -568 | 0.36 | 2.5 | 17.9 | 17349 | ▇▁▁▁▁ |
unique(tidy_GDP_data$IndicatorName)## [1] "Final consumption expenditure"
## [2] "Household consumption expenditure (including Non-profit institutions serving households)"
## [3] "General government final consumption expenditure"
## [4] "Gross capital formation"
## [5] "Gross fixed capital formation (including Acquisitions less disposals of valuables)"
## [6] "Exports of goods and services"
## [7] "Imports of goods and services"
## [8] "Gross Domestic Product (GDP)"
## [9] "Agriculture, hunting, forestry, fishing (ISIC A-B)"
## [10] "Mining, Manufacturing, Utilities (ISIC C-E)"
## [11] "Manufacturing (ISIC D)"
## [12] "Construction (ISIC F)"
## [13] "Wholesale, retail trade, restaurants and hotels (ISIC G-H)"
## [14] "Transport, storage and communication (ISIC I)"
## [15] "Other Activities (ISIC J-P)"
## [16] "Total Value Added"
## [17] "Changes in inventories"
indicator_list <- c("Gross capital formation",
"Exports of goods and services",
"Imports of goods and services",
"Household consumption expenditure (including Non-profit institutions serving households)",
"General government final consumption expenditure")library(scales)
country_list_gdp <- tidy_GDP_data%>%
filter(Country %in% country_list,IndicatorName %in% indicator_list )
ggplot(country_list_gdp,aes(x = as.numeric(year), group = IndicatorName))+
geom_line(aes(y = value, color = IndicatorName))+
facet_wrap(~Country)+
theme_bw()+
theme(legend.position = "right")+
scale_x_continuous(breaks = scales::pretty_breaks(n=5))+
labs(title="GDP components over time",
subtitle="In constant 2010 USD",
x="",
y="Billion US$")+
scale_color_discrete(labels = c("Gross capital formation",
"Exports",
"Goverment Expenditure",
"Household expenditure",
"Imports")
)Secondly, recall that GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in your dataframe, I would like you to calculate it given its components discussed above.
What is the % difference between what you calculated as GDP and the GDP figure included in the dataframe?
tidy_gdp_wider <- tidy_GDP_data%>%
pivot_wider(names_from = IndicatorName, values_from = value)
tidy_gdp_wider <- tidy_gdp_wider%>%
mutate(net_export = tidy_gdp_wider[[indicator_list[2]]] - tidy_gdp_wider[[indicator_list[3]]])
tidy_gdp_wider <- tidy_gdp_wider%>%
mutate(calc_gdp = tidy_gdp_wider[[indicator_list[1]]]+
tidy_gdp_wider[[indicator_list[4]]]+
tidy_gdp_wider[[indicator_list[5]]]+
tidy_gdp_wider[["net_export"]],
prc_of_gdp_gross_capital = tidy_gdp_wider[[indicator_list[1]]]/calc_gdp * 100,
prc_of_gdp_house = tidy_gdp_wider[[indicator_list[4]]]/calc_gdp * 100,
prc_of_gdp_gov = tidy_gdp_wider[[indicator_list[5]]]/calc_gdp * 100,
prc_of_gdp_net_exp = tidy_gdp_wider[["net_export"]]/calc_gdp * 100)
indicator_list_2 <- c("Gross capital formation",
"Household consumption expenditure (including Non-profit institutions serving households)",
"General government final consumption expenditure",
"net_export")
calc_gdp_longer <- tidy_gdp_wider%>%
pivot_longer(cols = 22:26 , names_to = "IndicatorName")
calc_gdp_longer%>%
filter(Country %in% country_list)%>%
filter(IndicatorName %in% c("prc_of_gdp_gross_capital","prc_of_gdp_house","prc_of_gdp_gov","prc_of_gdp_net_exp"))%>%
ggplot(aes(x = as.numeric(year), group = IndicatorName))+
geom_line(aes(y = value, color = IndicatorName))+
facet_wrap(~Country)+
theme_bw()+
theme(legend.position = "right")+
scale_color_discrete(labels = c("Goverment Expenditure",
"Gross Capital Formation",
"Gross Household Expenditure",
"Net Export")
)+
scale_x_continuous(breaks = scales::pretty_breaks(n=5))+
labs(title="GDP and its breakdown at constant 2010 prices in US Dollars",
x="",
y="Billion US$")+
scale_y_continuous(labels = function(y) sprintf("%.1f %%",y))tidy_gdp_wider<- tidy_gdp_wider%>%
mutate(calc_vs_real_gdp = calc_gdp / tidy_gdp_wider[["Gross Domestic Product (GDP)"]]-1 )What is the % difference between what you calculated as GDP and the GDP figure included in the dataframe? TEAM ANSWER:
The percentage difference changes from year to year depending on how much the factors within the table that we did not use in our calculation contributed to the total GDP.
What is this last chart telling you? Can you explain in a couple of paragraphs the different dynamic among these three countries? TEAM ANSWER:
All three countries have highest proportion of GDP in Household Expenditure and lowest proportion of GDP in Net Exports.
India has the highest household consumption proportion compared to Germany and the United States, yet it is decreasing over the years. In contrast, gross capital formation proportion increased significantly over the years. This might be due to India being a developing country where incomes flows towards investment capital consumption from household consumption capital.
Germany and the United states showcase similar pattern of all proportion being relatively stable over the years. Both countries are developed countries with stable development, which explains the relatively less volatile proportion changes.
If you want to, please change
country_list <- c("United States","India", "Germany")to include your own country and compare it with any two other countries you like
There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.